Aim
R language and it’s packages ecosystem are wonderful tools for the data analysis. In community ecology field, many packages can be used for the ecological and evolutionary analysis, such as vegan[1], ape[2] and picante[3]. In microbial community ecology, with the development of the high-throughput sequencing techniques, the increasing data amount and complexity make the community data analysis a challenge. There has been a lot of R packages created for the community data analysis in microbial ecology, such as phyloseq[4], microbiomeSeq, ampvis2, mare and microbiome. However, we lack a flexible and modularized R package to analyze and manage all the data. Thus we created microeco R package for this goal. This package provide powerful plotting and statistical approaches.
Main Features
- R6 Class to store and analyze data; fast, flexible and modularized
- Plotting the taxonomic abundance
- Venn diagram
- Alpha diversity
- Beta diversity
- Differential abundance analysis
- Effects of environmental variables on community structure
- Network analysis
- Null model analysis
- Functional analysis
Class introduction
All the classes in microeco package depend on the R6 class. R6 uses the encapsulated object-oriented programming paradigm, which means that R6 is a profoundly different OO system from S3 and S4 because it is built on encapsulated objects, rather than generic functions. If you are interested in the class feature, read those (from book Advanced R):
A generic is a regular function so it lives in the global namespace. An R6 method belongs to an object so it lives in a local namespace. This influences how we think about naming. The methods belong to objects, not generics, and you call them like object$method().
R6’s reference semantics allow methods to simultaneously return a value and modify an object.
You invoke an R6 method using $, which is an infix operator. If you set up your methods correctly you can use chains of method calls as an alternative to the pipe. Every R6 object has an S3 class that reflects its hierarchy of R6 classes.
Details
microtable class
Many tools can be used to do the bioinformatic analysis, such as QIIME[5], usearch(https://www.drive5.com/usearch/), mothur[6], and RDP(http://rdp.cme.msu.edu/). Although the format of results may be different from different tools, the main files can be classified as the following parts: 1-OTU table, the traditional species-sample table; 2-taxonomy table, the OTU taxonomy assignments information table; 3-phylogenetic tree. Generally, it is necessary to create a sample information table to store all the detailed sample information, including the physicochemical data and the missing value (NA). The funtions read_qiime_otu_table() and split_assignments() in package qiimer are appropriate for the parsing of raw QIIME OTU table file.
We use microtable class to store all the required data for the following analysis. We use the example data in the package to show the tutorials. The dataset is the 16S sequencing results of wetland soils in China[7]. In this dataset, the env_data table is separate from the sample information table.
library(microeco)
# load the example data
data(otu_table)
data(taxonomy_table)
data(sample_info)
data(phylo_tree)
data(env_data)
# set.seed is used to fix the random number generation to make the results repeatable
set.seed(123)
# make the plotting background same with the tutorial
theme_set(theme_bw())[1] “data.frame”
| 101 | 102 | 103 | 104 | 105 | |
|---|---|---|---|---|---|
| OTU_4272 | 4 | 0 | 2 | 1 | 0 |
| OTU_236 | 3 | 0 | 0 | 0 | 3 |
| OTU_399 | 39 | 51 | 59 | 20 | 22 |
| OTU_1556 | 10 | 8 | 10 | 4 | 8 |
| OTU_32 | 37 | 41 | 3 | 19 | 39 |
[1] “data.frame”
| Kingdom | Phylum | Class | |
|---|---|---|---|
| OTU_4272 | k__Bacteria | p__Firmicutes | c__Bacilli |
| OTU_236 | k__Bacteria | p__Chloroflexi | c__ |
| OTU_399 | k__Bacteria | p__Proteobacteria | c__Betaproteobacteria |
| OTU_1556 | k__Bacteria | p__Acidobacteria | c__Acidobacteria |
| OTU_32 | k__Archaea | p__Miscellaneous Crenarchaeotic Group | c__ |
| Order | Family | Genus | Species | |
|---|---|---|---|---|
| OTU_4272 | o__Bacillales | f__ | g__ | s__ |
| OTU_236 | o__ | f__ | g__ | s__ |
| OTU_399 | o__Nitrosomonadales | f__Nitrosomonadaceae | g__ | s__ |
| OTU_1556 | o__Subgroup 17 | f__ | g__ | s__ |
| OTU_32 | o__ | f__ | g__ | s__ |
[1] “data.frame”
| SampleID | Group | Type | Saline |
|---|---|---|---|
| 1 | IW | NE | Non-saline soil |
| 2 | IW | NE | Non-saline soil |
| 3 | IW | NE | Non-saline soil |
| 4 | IW | NE | Non-saline soil |
| 5 | IW | NE | Non-saline soil |
[1] “data.frame”
| Latitude | Longitude | Altitude | Temperature | Precipitation |
|---|---|---|---|---|
| 52.96 | 122.6 | 432 | -4.2 | 445 |
| 52.95 | 122.6 | 445 | -4.3 | 449 |
| 52.95 | 122.6 | 430 | -4.3 | 449 |
| 52.95 | 122.6 | 430 | -4.3 | 449 |
| 52.95 | 122.6 | 429 | -4.3 | 449 |
[1] “phylo”
Then, we make an object use microtable class. Indeed, this operation is very similar with that in package phyloseq[4], but this is simpler and faster.
# As an example, we only use 90 samples, 30 samples in each group (CW, IW, TW), see "Group" column in sample_info table.
# obtain the sample names
use_sample_names <- lapply(unique(sample_info$Group), function(x) rownames(sample_info[sample_info$Group == x, ])[1:30]) %>% unlist
# In R6 class, '$new' is the original method used to create a new object of class
dataset <- microtable$new(sample_table = sample_info[use_sample_names, ], otu_table = otu_table, tax_table = taxonomy_table, phylo_tree = phylo_tree)
class(dataset)[1] “microtable” “R6”
microtable class: sample_table have 90 rows and 4 columns otu_table have 14113 rows and 200 columns tax_table have 14113 rows and 7 columns phylo_tree have 14096 tips
To make the species and sample information consistent across different files within the dataset object, we can use function tidy_dataset() to trim the dataset.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13628 rows and 7 columns phylo_tree have 13628 tips
Then, we remove OTUs which are not assigned in the Kingdom "k__Archaea" or "k__Bacteria".
dataset$tax_table %<>% base::subset(Kingdom == "k__Archaea" | Kingdom == "k__Bacteria")
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13330 rows and 7 columns phylo_tree have 13628 tips
We also remove OTUs with the taxonomic assignments “mitochondria” or “chloroplast”.
# This will remove the lines containing the taxa word regardless of taxonomic ranks and ignoring word case in the tax_table.
# So if you want to filter some taxa not considerd pollutions, please use subset.
dataset$filter_pollution(taxa = c("mitochondria", "chloroplast"))
print(dataset)microtable class: sample_table have 90 rows and 4 columns otu_table have 13628 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13628 tips
Then, to make the OTUs same in otu_table, tax_table and phylo_tree, we use tidy_dataset() again.
microtable class: sample_table have 90 rows and 4 columns otu_table have 13296 rows and 90 columns tax_table have 13296 rows and 7 columns phylo_tree have 13296 tips
Then we use sample_sums() to check the sequence numbers in each sample.
[1] 10316 37087
Sometimes, in order to reduce the effects of species number on the diversity measurements, we need to perform the resampling to make the sequence number same across samples.
# As the example, we use 10000 sequences in each sample
dataset$rarefy_samples(sample.size = 10000)
dataset$sample_sums() %>% range[1] 10000 10000
Then, we calculate the taxa abundance at each taxonomic rank using cal_abund(). This function return a list called taxa_abund containing several data frame of the abundance information at each taxonomic rank. The list is stored in the object microtable automatically.
[1] “list”
If you want to save the taxa abundance file to the computer, use save_abund().
Then, we calculate the alpha diversity. The result is also stored in the object microtable automatically. As an example, we do not calculate phylogenetic diversity.
# If you want to add Faith's phylogenetic diversity, use PD = TRUE
dataset$cal_alphadiv(PD = FALSE)
# save dataset$alpha_diversity to the computer
dataset$save_alphadiv(dirpath = "alpha_diversity")We also calculate the beta diversity (distance matrix) using function cal_betadiv(). We provide four most frequently used indexes: Bray-curtis, Jaccard, weighted Unifrac and unweighted unifrac.
trans_abund class
This class is used to transform abundance data for plotting the taxa abundance. We first use this class for the bar plot. We use the highest 10 Phylum in dataset.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10)
# t1 now include the transformed abundance data t1$abund_data and other elements for the following plottingWe do not retain the sample names in x axis and add the facet to show group.
The groupmean parameter can be used to obtain the group-mean barplot.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Phylum", ntaxa = 10, groupmean = "Group")
t1$plot_bar(others_color = "grey70", legend_text_italic = FALSE)The box plot is a good way to intuitionally show data distribution across groups.
t1 <- trans_abund$new(dataset = dataset, taxrank = "Class", ntaxa = 15)
t1$plot_box(group = "Group")We also show the heatmap using the high abundant Genus
t1 <- trans_abund$new(dataset = dataset, taxrank = "Genus", ntaxa = 40)
t1$plot_heatmap(facet = "Group", xtext_keep = FALSE, withmargin = FALSE)trans_venn class
The trans_venn class is used for venn plotting. We first merge samples according to the “Group” column of sample_table.
# merge samples to one community for each group
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1, ratio = "seqratio")
t1$plot_venn()
# The integer data is OTU number
# The percentage data is the sequence number/total sequence numberWhen the groups are too many to show using venn plot, we can use petal plot.
dataset1 <- dataset$merge_samples(use_group = "Type")
t1 <- trans_venn$new(dataset1)
t1$plot_venn(petal_plot = TRUE)Sometimes, we are interested in the particular and shared species. In another words, the composition of the particular or shared species may account for the different and similar parts of ecological characteristics across groups[8]. For this goal, we first transform the results of venn plot to the traditional species-sample table, that is, another object of microtable class.
dataset1 <- dataset$merge_samples(use_group = "Group")
t1 <- trans_venn$new(dataset1)
# transform venn results to the sample-species table, do not consider abundance, only use frequency.
t2 <- t1$trans_venn_com(use_OTUs_frequency = TRUE)
class(t2)[1] “microtable” “R6”
We use bar plot to show the composition at the Genus level.
# calculate taxa abundance, that is, the frequency
t2$cal_abund()
# transform and plot; $ is like the pipe %>%
trans_abund$new(dataset = t2, taxrank = "Genus", ntaxa = 10)$
plot_bar(bar_type = "part", legend_text_italic = T, ylab_title = "Frequency (%)", xtext_type_hor = FALSE)We also try to use pie chart to show the composition at the Phylum level.
trans_abund$new(dataset = t2, taxrank = "Phylum", ntaxa = 8)$
plot_pie(facet_nrow = 3, use_colors = rev(c(RColorBrewer::brewer.pal(8, "Dark2"), "grey50")))trans_alpha class
Alpha diversity can be transformed and plotted using trans_alpha class. Creating this class can return two data frame-alpha_data and alpha_stat.
| Group | Measure | N | Mean | SD | SE |
|---|---|---|---|---|---|
| CW | Observed | 30 | 1843 | 220.6 | 40.27 |
| CW | Chao1 | 30 | 2553 | 338.1 | 61.73 |
| CW | ACE | 30 | 2716 | 367 | 67.01 |
| CW | Shannon | 30 | 6.308 | 0.5355 | 0.09777 |
| CW | Simpson | 30 | 0.9897 | 0.01305 | 0.002382 |
Then, we test the differences among groups using the KW rank sum test and anova with multiple comparisons.
| Groups | Measure | Test method | p.value | Significance |
|---|---|---|---|---|
| IW vs CW | Observed | KW | 0.0371 | * |
| IW vs TW | Observed | KW | 0.4553 | |
| CW vs TW | Observed | KW | 0.3912 | |
| IW vs CW vs TW | Observed | KW | 0.155 | |
| IW vs CW | Chao1 | KW | 0.002689 | ** |
| Observed | Chao1 | ACE | Shannon | Simpson | InvSimpson | Fisher | Coverage | |
|---|---|---|---|---|---|---|---|---|
| IW | a | a | a | a | a | a | a | b |
| TW | a | ab | b | a | a | a | a | a |
| CW | a | b | b | a | a | a | a | a |
Now, let us plot the mean and se of alpha diversity for each group, and add the agricolae::duncan.test result.
We can also use the boxplot to show the paired comparisons directly.
trans_beta class
Beta diversity can be transformed and plotted using trans_beta class. The analysis referred to the beta diversity in this class mainly include ordination, group distance and manova. We first show the ordination using PCoA.
# we first create an object and select PCoA for ordination
t1 <- trans_beta$new(dataset = dataset, group = "Group", measure = "bray", ordination = "PCoA")Then we plot and compare the group distances.
# plot sample distances within groups
t1$cal_group_distance()
t1$plot_group_distance(distance_pair_stat = TRUE)# plot sample distances between groups
t1$cal_group_distance(within_group = FALSE)
t1$plot_group_distance(distance_pair_stat = TRUE)Clustering plot is also a frequently used method.
# use replace_name to set the label name, group parameter used to set the color
t1$plot_clustering(group = "Group", replace_name = c("Saline", "Type"))perMANOVA is often used in the differential test of distances among groups.
| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 10.57 | 0.1955 | 0.001 |
| Residuals | 87 | 25.18 | 0.2895 | NA | 0.8045 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
| Groups | measure | permutations | R2 | p.value | Significance |
|---|---|---|---|---|---|
| IW vs CW | bray | 999 | 0.1595 | 0.001 | *** |
| IW vs TW | bray | 999 | 0.147 | 0.001 | *** |
| CW vs TW | bray | 999 | 0.1556 | 0.001 | *** |
# manova for specified group set: here "Group + Type"
t1$cal_manova(cal_manova_set = "Group + Type")
# t1$res_manova$aov.tab| Df | SumsOfSqs | MeanSqs | F.Model | R2 | Pr(>F) | |
|---|---|---|---|---|---|---|
| Group | 2 | 6.121 | 3.06 | 12.01 | 0.1955 | 0.001 |
| Type | 3 | 3.783 | 1.261 | 4.949 | 0.1208 | 0.001 |
| Residuals | 84 | 21.4 | 0.2548 | NA | 0.6836 | NA |
| Total | 89 | 31.3 | NA | NA | 1 | NA |
trans_diff class
Differential abundance test is a very important part in the microbial community data analysis. It can be used to find the significant taxa in determining the community differences across groups. Currently, trans_diff class have three famous approaches to perform this analysis: metastat, rf and LEfSe[9]. Metastat[10] depends on the permutations and t-test and performs well on the sparse data. It is used for the comparisons between two groups.
# metastat Genus level
t1 <- trans_diff$new(dataset = dataset, method = "metastat", group = "Group", metastat_taxa_level = "Genus")
# plot the first paired groups, choose_group = 1
t1$plot_metastat(use_number = 1:10, qvalue = 0.05, choose_group = 1)LEfSe combines the non-parametic test and LDA and depends on the python. Now we rewrite this approach in this package.
t1 <- trans_diff$new(dataset = dataset, method = "lefse", group = "Group", alpha = 0.01, lefse_subgroup = NULL)
t1$plot_lefse_bar(LDA_score = 4)| Taxa | Group | pvalue | LDA |
|---|---|---|---|
| k__Bacteria|p__Proteobacteria | CW | 2.004e-11 | 4.852 |
| k__Bacteria|p__Acidobacteria | IW | 6.708e-12 | 4.786 |
| k__Bacteria|p__Acidobacteria|c__Acidobacteria | IW | 6.47e-13 | 4.784 |
| k__Bacteria|p__Bacteroidetes | TW | 5.139e-10 | 4.777 |
| k__Bacteria|p__Proteobacteria|c__Gammaproteobacteria | CW | 9.365e-12 | 4.628 |
We can also plot the abundance of taxa detected using LEfSe. If you are interested in the taxonomic tree, you can use metacoder package[Foster_Metacoder_2017] to plot the taxonomic tree based on the selected taxa. We do not show the usage here.
The third approach is called rf, which depend on the random forest[11, 12] and the non-parametic test.
t1 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "all")
t2 <- t1$plot_diff_abund(use_number = 1:20, only_abund_plot = FALSE)
gridExtra::grid.arrange(t2$p1, t2$p2, ncol=2, nrow = 1, widths = c(2,2))trans_env class
The environmental variables or physicochemical measurements are very useful in analyzing microbial community structure and assembly mechanisms. We first show the RDA analysis using both the dbrda and rda.
# add_data is used to add the physicochemical data
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])# use bray-curtis distance to do dbrda
t1$cal_rda(use_dbrda = TRUE, use_measure = "bray")
t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 10)
t1$plot_rda(plot_color = "Group")# use Genus
t1$cal_rda(use_dbrda = FALSE, taxa_level = "Genus")
# As the main results of RDA is related with the projection and angles between different arrows,
# we adjust the length of the arrow to show them clearly.
t1$trans_rda(adjust_arrow_length = TRUE, max_perc_env = 1500, max_perc_tax = 3000, min_perc_env = 200, min_perc_tax = 300)
# show important genus
t1$plot_rda(plot_color = "Group")Mantel test can be used to check whether there is significant correlations between physicochemical variables and distance metrics.
| variable_name | cor_method | corr_res | p_res | significance |
|---|---|---|---|---|
| Temperature | pearson | 0.452 | 0.001 | *** |
| Precipitation | pearson | 0.2791 | 0.001 | *** |
| TOC | pearson | 0.13 | 0.003 | ** |
| NH4 | pearson | -0.05539 | 0.922 | |
| NO3 | pearson | 0.06758 | 0.049 | * |
| pH | pearson | 0.4085 | 0.001 | *** |
| Conductivity | pearson | 0.2643 | 0.001 | *** |
| TN | pearson | 0.1321 | 0.002 | ** |
The correlations between physicochemical variables and taxa are important in analyzing or inferring the factors affecting community structure. In this example, we first perform the differential abundance test and random forest analysis to obtain the important genus. Then we use those taxa to perform correlation analysis.
t2 <- trans_diff$new(dataset = dataset, method = "rf", group = "Group", rf_taxa_level = "Genus")
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
t1$cal_cor(use_data = "other", p_adjust_method = "fdr", other_taxa = t2$res_rf$Taxa[1:40])Then, we can plot the correlation results using ggplot2 or pheatmap.
If you are concerned with the relationship between environmental factors and alpha diversity, you can also use this function.
t1 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
# use add_abund_table parameter to add the extra data table
t1$cal_cor(add_abund_table = dataset$alpha_diversity)
t1$plot_corr()trans_network class
Network has been a frequently used approach to study the co-occurrence patterns in microbial ecology[13–15]. In this part, we describe all the core contents in the trans_network class. The network construction approaches can be classified into two types: correlation-based and non correlation-based. Several approaches can be used to calculate correlations and significances. So we create a class called trans_corr to calculate correlations. trans_network class inherits the trans_corr class.
We first introduce the correlation-based network. The parameter cal_cor in trans_network controls the correlation calculation method.
# Use R base cor.test, slow
t1 <- trans_network$new(dataset = dataset, cal_cor = "base", taxa_level = "OTU", filter_thres = 0.0001)# Use R WGCNA package when the OTU number is large; very fast
t1 <- trans_network$new(dataset = dataset, cal_cor = "WGCNA", taxa_level = "OTU", filter_thres = 0.0001)# SparCC method, you can use nThreads parameter to increase CPU threads to accelerate the calculation, see trans_corr class documentation.
t1 <- trans_network$new(dataset = dataset, cal_cor = "SparCC", taxa_level = "OTU", filter_thres = 0.0001, SparCC_simu_num = 100)# contruct network; require igraph package; return t1$res_network
# use RMT to find optimal correlation coefficient
t1$cal_network(p_thres = 0.01, COR_optimization = TRUE)# use arbitrary coefficient threshold to contruct network
t1$cal_network(p_thres = 0.01, COR_cut = 0.7)# save network; open the file using Gephi
t1$save_network(filepath = "network.gexf")
# the following picture is an example; using gephi# calculate network attributes; return t1$res_network_attr
t1$cal_network_attr()
# t1$res_network_attr| Vertex | 1766 |
| Edge | 17940 |
| Average_degree | 20.32 |
| Average_path_length | 9.882 |
| Network_diameter | 17 |
| Clustering_coefficient | 0.6454 |
| Density | 0.01151 |
| Heterogeneity | 0.9755 |
| Centralization | 0.04628 |
# classify the node; return t1$res_node_type
t1$cal_node_type()
# we retain the file for the following use in trans_func part
network_node_type <- t1$res_node_type
# t1$res_node_type[1:10, c(1:4, 12:13)]| z | module | p | taxa_roles | degree | betweenness | |
|---|---|---|---|---|---|---|
| OTU_236 | 0.6822 | M1 | 0 | Peripheral nodes | 44 | 59 |
| OTU_32 | -0.4248 | M1 | 0 | Peripheral nodes | 22 | 117 |
| OTU_2733 | -0.9783 | M1 | 0 | Peripheral nodes | 11 | 23664 |
| OTU_4050 | 0.8332 | M1 | 0 | Peripheral nodes | 47 | 12753 |
| OTU_52 | 0.33 | M1 | 0.1352 | Peripheral nodes | 37 | 2466 |
| OTU_36 | 1.034 | M1 | 0 | Peripheral nodes | 51 | 3497 |
| OTU_2721 | -0.5255 | M1 | 0.4444 | Peripheral nodes | 20 | 30 |
| OTU_192 | -1.28 | M1 | 0.375 | Peripheral nodes | 5 | 0 |
| OTU_94 | -1.431 | M1 | 0 | Peripheral nodes | 2 | 0 |
| OTU_5862 | -1.381 | M1 | 0 | Peripheral nodes | 3 | 212 |
# calculate the links between or within taxonomic ranks
t1$cal_sum_links(taxa_level = "Phylum")
# plot the summed links; require chorddiag package
t1$plot_sum_links(plot_pos = TRUE, plot_num = 10)The subset_network() function can be used to take a part of nodes and edges among these nodes from the network. In this function, you should provide the nodes you need using the node parameter.
# this return a sub network that contains all nodes of module M1
t1$subset_network(node = t1$res_node_type %>% .[.$module == "M1", ] %>% rownames, rm_single = TRUE)We also introduce another network construction approach: Probabilistic Graphical Models (PGM), which is implemented in julia package FlashWeave[16]. So if you want to use this method like the following code, you should first install julia language in your computer and the FlashWeave package, and add the julia in the computer path.
trans_nullmodel class
To make the use of null model convenient, we incorporate several null models and phylogenetic analysis[17]. We first check the phylogenetic signal.
# generate object; use 1000 OTUs as example; use cpp to accelerate betaMPD calculation
t1 <- trans_nullmodel$new(dataset, taxa_number = 1000, add_data = env_data, cpp = TRUE)betaNRI(ses.betampd) is used to show the ‘basal’ phylogenetic turnover[18].
If we want to plot the betaNRI, we can use plot_group_distance function in trans_beta class. The results showed that the mean betaNRI of TW is extremely and significantly larger that those in CW and IW, revealing that the basal phylogenetic turnover in TW is high, if no strong dispersal limitation exists among the wetlands of different areas in China.
t2 <- trans_beta$new(dataset = dataset, group = "Group")
# must assign use_matrix value. This matrix is the default matrix to store the beta-diversity matrix and calculate group distance.
dataset$beta_diversity[["betaNRI"]] <- t1$res_ses_betampd
# create trans_beta class, use measure "betaNRI"
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
# transform the distance for each group
t2$cal_group_distance()
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)Sometimes, if you want to perform null model analysis for each group individually with a specific hypothesis, you can calculate the results for each group, respectively. We can find that, when we perform betaNRI for each group respectively, the mean betaNRI between CW and TW are not significantly different, and are both significantly higher than that in IW, revealing that the strength of variable selection in CW and TW may be similar under the condition that each area have a specific species pool.
# we create a list to store the trans_nullmodel results.
sesbeta_each <- list()
group_col <- "Group"
all_groups <- unique(dataset$sample_table[, group_col])
# calculate for each group, respectively
for(i in all_groups){
# like the above operation, but need provide 'group' and 'select_group'
test <- trans_nullmodel$new(dataset, group = group_col, select_group = i, taxa_number = 1000, add_data = env_data, cpp = TRUE)
test$cal_ses_betampd(runs = 500, abundance.weighted = TRUE)
sesbeta_each[[i]] <- test$res_ses_betampd
}
# merge and reshape to generate one symmetrical matrix
test <- lapply(sesbeta_each, melt) %>% do.call(rbind, .) %>%
reshape2::dcast(., Var1~Var2, value.var = "value") %>% `row.names<-`(.[,1]) %>% .[, -1, drop = FALSE]
# like the above operation
dataset$beta_diversity[["betaNRI"]] <- test
t2 <- trans_beta$new(dataset = dataset, group = "Group", measure = "betaNRI")
t2$cal_group_distance()
g1 <- t2$plot_group_distance(distance_pair_stat = TRUE)
g1 + geom_hline(yintercept = -2, linetype = 2) + geom_hline(yintercept = 2, linetype = 2)betaNTI(ses.betamntd) can be used to indicate the phylogenetic terminal turnover.
# null model run 500 times
t1$cal_ses_betamntd(runs=500, abundance.weighted = TRUE)
# t1$res_ses_betamntd[1:5, 1:5]| 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|
| 0 | -5.637 | -5.701 | -5.758 | -5.531 |
| -5.637 | 0 | -6.101 | -6.275 | -6.013 |
| -5.701 | -6.101 | 0 | -6.333 | -6.197 |
| -5.758 | -6.275 | -6.333 | 0 | -6.141 |
| -5.531 | -6.013 | -6.197 | -6.141 | 0 |
rcbray can be calculated use function cal_rcbray().
As an example, we also calculate the propotion of the inferred processes on the community assembly.
| process | percentage |
|---|---|
| variable selection | 4.419 |
| homogeneous selection | 48.71 |
| dispersal limitation | 0 |
| homogeneous dispersal | 8.739 |
| drift | 38.13 |
trans_func class
Ecological researchers are usually interested in the the funtional profiles of microbial communities. Because functional or metabolic data are more powerful to explain the structure and dynamics of microbial communities and to infer the underlying mechanisms. As metagenomic sequencing is complicated and expensive, using amplicon sequencing data to predict functional profiles is a good choice. Several software are often used for this goal, such as PICRUSt[19], Tax4Fun[20] and FAPROTAX[21, 22]. These tools are great to be used for the prediction of functional profiles based on the prokaryotic communities from sequencing results. Sometimes, it is also important to obtain the functions of each taxa or OTU, not just the whole profile of communities. But it is hard to know which OTU have what functions exactly. FAPROTAX database is a collection of the traits and characteristics of prokaryotes based on the books and literatures. We use it to analyze the biogeochemical roles of prokaryotes in this part.
# create object of trans_func
t2 <- trans_func$new(dataset)
# mapping the taxonomy to the database
t2$cal_spe_func()
# return otu_func_table, 1 represent function exists, 0 represent no or cannot confirmed.
# t2$otu_func_table[1:5, 1:2]| methanotrophy | acetoclastic_methanogenesis | |
|---|---|---|
| OTU_4272 | 0 | 0 |
| OTU_236 | 0 | 0 |
| OTU_399 | 0 | 0 |
| OTU_1556 | 0 | 0 |
| OTU_32 | 0 | 0 |
The percentages of the OTUs having one trait can reflect the functional redundancy and potential of this function in the community or the specific pattern of the modules in the network.
# calculate the percentages of OTUs for each trait in each modules of network
# use_community = FALSE represent calculating module, not community, node_type_table provide the module information
t2$cal_spe_func_perc(use_community = FALSE, node_type_table = network_node_type)
# we only plot some important traits, so we use the default group list to filter and show the traits.
t2$plot_spe_func_perc(group_list_default = TRUE)
# M represents module# calculate the percentages for communities
t2$cal_spe_func_perc(use_community = TRUE)
# t2$res_spe_func_perc[1:5, 1:2]| methanotrophy | acetoclastic_methanogenesis |
|---|---|
| 0.39 | 0.04 |
| 0.27 | 0 |
| 0.48 | 0 |
| 0.48 | 0 |
| 0.56 | 0 |
# then we try to correlate the res_spe_func_perc of communities to environmental variables
t3 <- trans_env$new(dataset = dataset, add_data = env_data[, 4:11])
t3$cal_cor(add_abund_table = t2$res_spe_func_perc, cor_method = "spearman")
t3$plot_corr(pheatmap = TRUE)We also calculate the functional potential of communities on the biogeochemical cycles by using FAPROTAX software.
t2 <- trans_func$new(dataset)
# This function will invoke the python 2.7
t2$cal_biogeo()
# t2$res_biogeo[1:10, 1:3]| 1 | 2 | 3 | |
|---|---|---|---|
| methanotrophy | 26 | 15 | 16 |
| acetoclastic_methanogenesis | 2 | 0 | 0 |
| methanogenesis_by_disproportionation_of_methyl_groups | 5 | 0 | 0 |
| methanogenesis_using_formate | 0 | 0 | 0 |
| methanogenesis_by_CO2_reduction_with_H2 | 23 | 2 | 5 |
| methanogenesis_by_reduction_of_methyl_compounds_with_H2 | 0 | 0 | 0 |
| hydrogenotrophic_methanogenesis | 23 | 2 | 5 |
| methanogenesis | 25 | 2 | 5 |
| methanol_oxidation | 12 | 2 | 6 |
| methylotrophy | 38 | 17 | 22 |
Tax4Fun have a strict input file demand on the taxonomic information. To use Tax4Fun to analyze the trimmed or changed OTU data in R, we provide a link to the Tax4Fun functional prediction.
t1 <- trans_func$new(dataset)
# require install Tax4Fun and download SILVA123 ref data from http://tax4fun.gobics.de/
# decompress SILVA123; provide path in folderReferenceData as you put
t1$cal_tax4fun_func(folderReferenceData = "./SILVA123")
# return two file: tax4fun_KO: KO file; tax4fun_path: pathway file.
# t1$tax4fun_KO$Tax4FunProfile[1:5, 1:2]| K00001; alcohol dehydrogenase [EC:1.1.1.1] | K00002; alcohol dehydrogenase (NADP+) [EC:1.1.1.2] |
|---|---|
| 0.0004823 | 5.942e-06 |
| 0.0005266 | 4.017e-06 |
| 0.0005054 | 6.168e-06 |
| 0.0005109 | 5.888e-06 |
| 0.0005083 | 5.547e-06 |
Now, we use pathway file to analyze the abundance of pathway.
# must transpose to taxa row, sample column
pathway_file <- t1$tax4fun_path$Tax4FunProfile %>% t %>% as.data.frame
# filter rownames, only keep ko+number
rownames(pathway_file) %<>% gsub("(^.*);\\s.*", "\\1", .)
# load the pathway hierarchical file
data(ko_map)
# make a microtable, is it kind of familar?
func1 <- microtable$new(otu_table = pathway_file, tax_table = ko_map, sample_table = t1$sample_table)
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 341 rows and 3 columns
Now, we need to trim data and calculate abundance.
func1$tidy_dataset()
# calculate abundance automatically at three levels: level_1, level_2, level_3
func1$cal_abund()
print(func1)microtable class: sample_table have 90 rows and 4 columns otu_table have 284 rows and 90 columns tax_table have 284 rows and 3 columns Taxa abundance: calculated for level_1,level_2,level_3
Then, we can plot the abundance.
# bar plot at level_1
func2 <- trans_abund$new(func1, taxrank = "level_1", groupmean = "Group")
func2$plot_bar(legend_text_italic = FALSE)We can also do something else. For example, we can use lefse to test the differences of the abundance and find the important enriched pathways across groups.
func2 <- trans_diff$new(dataset = func1, method = "lefse", group = "Group", alpha = 0.05, lefse_subgroup = NULL)
func2$plot_lefse_bar(LDA_score = 3, width = 0.8)Something important
clone
R6 class has a special copy mechanism different from S3 and S4. If you want to copy an object completely, you should use function clone
# use clone to copy completely
t1 <- clone(dataset)
t2 <- clone(t1)
t2$sample_table <- NULL
identical(t2, t1)[1] FALSE
# this operation is usually unuseful, because changing t2 will also affect t1
t2 <- t1
t2$sample_table <- NULL
identical(t2, t1)[1] TRUE
change object
All the classes were set public, meaning that you can change, add or remove the objects stored in them as you want.
Contact
If the reader has any question, feel free to join the QQ group: 277434916 for the discussion.
References
1. Oksanen J, Blanchet FG, Friendly M, Kindt R, Legendre P, McGlinn D, et al. Vegan: Community ecology package. 2019.
2. Paradis E, Schliep K. Ape 5.0: An environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2018; 35: 526–528.
3. Kembel S, Cowan P, Helmus M, Cornwell W, Morlon H, Ackerly D, et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics 2010; 26: 1463–1464.
4. Mcmurdie PJ, Holmes S. Phyloseq: An r package for reproducible interactive analysis and graphics of microbiome census data. Plos One 2013; 8: e61217.
5. Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods 2010; 7: 335–336.
6. Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol 2009; 75: 7537–41.
7. An J, Liu C, Wang Q, Yao M, Rui J, Zhang S, et al. Soil bacterial community structure in chinese wetlands. Geoderma 2019; 337: 290–299.
8. Mendes R, Kruijt M, De BI, Dekkers E, Van der VM, Schneider JH, et al. Deciphering the rhizosphere microbiome for disease-suppressive bacteria. Science 2011; 332: 1097–100.
9. Segata N, Izard J, Waldron L, Gevers D, Miropolsky L, Garrett WS, et al. Metagenomic biomarker discovery and explanation. Genome Biol 2011; 12: R60.
10. White J, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Computational Biology 2009; 5: e1000352.
11. Beck D, Foster JA. Machine learning techniques accurately classify microbial communities by bacterial vaginosis characteristics. PloS one 2014; 9: e87830.
12. Yatsunenko T, Rey FE, Manary MJ, Trehan I, Dominguez-Bello MG, Contreras M, et al. Human gut microbiome viewed across age and geography. Nature 2012; 486: 222–227.
13. Deng Y, Jiang Y-H, Yang Y, He Z, Luo F, Zhou J. Molecular ecological network analyses. BMC bioinformatics 2012; 13: 113.
14. Faust K, Raes J. Microbial interactions: From networks to models. Nature Reviews Microbiology 2012; 10: 538–550.
15. Coyte KZ, Schluter J, Foster KR. The ecology of the microbiome: Networks, competition, and stability. Science 2015; 350: 663–666.
16. Tackmann J, Matias Rodrigues JF, Mering C von. Rapid inference of direct interactions in large-scale ecological networks from heterogeneous microbial sequencing data. Cell Syst 2019; 9: 286–296 e8.
17. Stegen JC, Lin X, Fredrickson JK, Chen X, Kennedy DW, Murray CJ, et al. Quantifying community assembly processes and identifying features that impose them. ISME J 2013; 7: 2069–2079.
18. Liu C, Yao M, Stegen JC, Rui J, Li J, Li X. Long-term nitrogen addition affects the phylogenetic turnover of soil microbial community responding to moisture pulse. Sci Rep 2017; 7: 17492.
19. Langille MG, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol 2013; 31: 814–21.
20. Aßhauer KP, Wemheuer B, Daniel R, Meinicke P. Tax4Fun: Predicting functional profiles from metagenomic 16S rRNA data. Bioinformatics 2015; 31: 2882–2884.
21. Louca S, Jacques SM, Pires AP, Leal JS, Srivastava DS, Parfrey LW, et al. High taxonomic variability despite stable functional structure across microbial communities. Nature Ecology & Evolution 2016; 1: 0015.
22. Louca S, Parfrey LW, Doebeli M. Decoupling function and taxonomy in the global ocean microbiome. Science 2016; 353: 1272.